DOCUMENT EESDME 



ED 083 284 



TB 003 246 



AUTHOR 
TITLE 

INSTITUTION 

SPONS AGENCY 

REPORT NO 
PUB DATE 
NOTE 

EDRS PRICE 
DESCRIPTORS 



ABSTRACT 



KleiUc Stephen P, ; Kosecoff, Jacqueline 

Issues and Procedures in the Development of Criterion 

Referenced Tests- 

ERIC Clearinghouse on Tests, Measurement, and 
Evaluation, Princeton^ N,J, 

National Inst, of Education (DHEW) , Washington, 

D.C. 

TM-R-26 

Sep 73 

18p. 



MF-$0-65 HC-$3.29 

♦Criterion Referenced Tests; Elemen 
♦Mathematics; Secondary Grades; •J'Te 
♦Testing Programs; *Tests 



tary Grades* 

St Construction; 



The basic steps and procedures in t 
criterion referenced tests (CRT) , as well as the is 
associated with these activities are discussed- In 
of the paper, the discussions focus upon the purpos 
characteristics of CRTs, item construction and sele 
item quality, content validity, item and test bias, 
packaging and other considerations- In the second s 
results of a survey conducted to assess current eff 
referenced testing are summarized- Five defining 
characteristics — program., focus, instructional depen 
and item generation, test models and packaging, and 
provided tor each of the following testing programs 
Bureau — McGraw-Hill, Prescriptive Mathematics Inven 
Comprehensive Achievement Monitoring; Individuali7-e 
Referenced Testing; Instructional Objectives Exchan 
Curriculum Project — ^University of Minnesota; Nation 
Educational Progress; Southwest Regional Laboratory 
Objectives Based Assessment — Reading, Center for th 
Evaluation; OCLA; and Zweig and Associates- From th 
questions that the CRT developer must answer in ord 
nature and purpose of a CRT are provided- (DB) 



he development of 
sue s and pro blems 
the first section 
e and defining 
ction, improving 

test scores, and 
ection, the 
orts in criterion 

dence, objective 

test scores — are 
: California Test 
tory; 

d Criterion 

ge; MINNEMAST 

al Assessment of 

; System for 

e Study cf 

is analysis, 10 

er to clarify the 



FILMED FROM BEST AVAILABLE COPY 




OO - 

oo 
CD 



ERIC CLLARINGHOUSE ON TESTS. MEASUREMENT, \ EVALUATION 
EDUCATIONAL TESTING SERVICE. PRINCETON.. NEW JERSEY 08540 

Condvictecl by Educational Testing Service in Association with Rutgers University Graduate School of Education 



REPORT 2() 



ISSUES AND PROCEDURES IN THE DEVELOPMENT OF 
CRITERION REFERENCED TESTS 

Stephen P. Klein 
Jacqueline Kosecoff 



SEPTEMBER 1973 



U S DEPARTMENTOF HEALTH, 
EDUCATION & WELFARE 
NATIONAL INSTITUTE OF 
EDUCATION' 



,ic. -r^bCLjVr'jT 



\ , .; po'NT^, O'V.fc-.VOV nP.NlONS 
M ( AVON WON O*^ POL K ■•■ 



PREFACE 



A visitor lo our planet ■.Earih survey.ing the current state oT 
educati6nai testing would very likely be confused by wliai' 
he found. He would observe, for example, tlie increasing 
use of tests in all phases and facets of the cducationai 
process including the evaluation of instructioiiarpersonncL 
He would learn, too, about the great technological improve- 
ments that have been made in tests and in their administra- 
tion, scoring, and. reporting procedures. All of these factors 
would tend to support th'' notion that tests are fulfilling an 
hnportaut and vita! role. On the other hand, this same 
observer miglit alio he.T the valid ctimplaints of the 
growing cadre of test critics. These critics complain that 
present tests are liiappropriiUc for most educational deci- 
sion niaking and. if a test is not going to he used for 
decision making, why bother giving it in the first place? 

Perhaps one of (lie quickest ways of alleviating our 
visitor's confusion is to point out to him certain changes 
that iiave been occurring in education and test'ng duiing 
die past, few years. Tor example, most expert test con- 
struction in the oast has focused upon a relatively few kinds 
of assessment instruments, sucli as (hose that are used to 
deride .whether a student should be accepted for college. 
Comparatively little help has been given lo (he classroofu 
teacher to diagnose individual student needs or asst^ss the 
outcomes ol" particular instructional programs. Now, how- 
ever, tiiero is gi\'\vijig desire lo individu::li/e inst rue lion, to 
assess validly th outcomes o[" iuMructional ])rogiams; and 
o lioki Icaeheis and udrnmisirators responsible !"or actual 
gains i!i student peilormancc. Tl-ese ir^^nds have increased 
Jhe demand on !esi developers for approjniaie tools to 
facilitate the measm'en\LM!t process, because e\i^iiug 
measures are useful for some imporiant cducationai ileci- 
sions l)U( are not designed to meet all Tieeds. It is evident. 



therefore, that test critics are complaining not about tests 
per se, but about' the need for certain kinds of quality 
measures that are not currently available. 

It is within this context of increased need for and reliance 
on valid test results that the- movement towards so-called 
''criterion referenced" tests" (CRT) has been given new 
impetus. A criterion referenced measure is essentially **one 
that is deliberately constructed spas to yield mea,surements 
that are directly interpret able in terms of specified per- 
formance standards.''' (Glaser ..6iL Nitko, 1971). The 
pertinent question is whether or not the individual has 
attained sjune significant degree of competence on an 
instructional perlonnunce task (Harris, 1972). 

Measured with these characteristics are, of course, Jiot 
new to education. What is new is the range of importance 
of the decision areas for wliich they are being employed or 
empliasi/e.d and the attention they are being given by 
measurement and curriculum experts alike (AirasiaEi & 
Madaus, 1972; Baker, E.. 1972; Keller, 1972; Davis, 1972,. 
197,V, Hawes, 1973). It would not be surprising, therefore, 
for us to witness during the nexl few years a iiumbcr of 
major contributions to testing theory and melhodoloev' 
arising from the use of criterion refcrv.'nced tests. Further, 
the improvement of such measures is likely to;liave many 
ramifications for instructional practice, since with improved 
tools even more reliance is likely to he placed upon the 
results obtained. For example, a bill is now pentling before 
the Uiiited States Congress that would require crilerlon 
refereiiced test data in order to make funding decision.. 



^'Pliosc perroniKinec standurds a'e usually bo liavio rally staled, for 
exaniiile: "The smdciU will l^e able lo porfomi a!l fundanieinal 
niiUliLiniatieal opLrLitions involvinLi sintile-digii inicgers/'' 



ERIC 



This publicaliim was prcpann' niirsiini; lo a Loiitraci with tlir Nauoiijl Insiiliilc .if l-dueaiion. U. S. Dcnanmeiit of llcaUju I-ducalion and 
Wclfjiv. ConlraeioiN iin.Icnak;ni;..sik"h i)ro.iL\'^ uniic. »rovcrninoni spi.iisorMiip arc ciieouraocd \o cxpri-ss freely iheir judgment in professional 
.tnd icdmieal nuner.. t\)ini^ rf -.ioV; (>r .pirii(u-is ao , roprrvcrn otfi. ial NaOMiial Instiiute i)f EdiiL-ation position or policy. 



iitTowting iliousamis of scIidoIs and involving sover;i! million 
dollars (Quie. I ^^73). 

It is iippropriatc ai tliis point in time, therefore, for us to 
examine how criterion refevenced tests are constructed and, 
more iniponantly. tlie basic issues and procedures asso- 
ciated with these steps. It is hop<^^d that such an appraisal 
will clarity some of the basic methodological and theoret- 
ical concerns associated with criterion ret'erenced tests that 
will be examined during the ext tew years. 

This" paper is divided into two parts. In the first section 
ihe niujor issues and steps in the development ot'CRTs are 
considered. In the second section representative CRT 



s\'stems in nKiilieinatics. .as well as importani elTorts in 
other content a»eas, have been selecicd foi review.- 

^Tho intcndod focus of this paper Wj's io he C R Ts in n);itl)n]i;ili.s: 
liowevor, a review; ol" tlie relevant liicratu'ro disclosed rela'ively iVw 
rcle'"cnces deatin^; CNclusively witli ihis Held, l-urtlier, iKose article:, 
pertaining specifically to malhcmatics niainly,dcscrihc th.* ilevcK^p- 
inent of particular instruments in certain contexts. '!Iie\ do not 
consider the general c concerns associated with il^.e development anvl 
use of CRTs in mathematics nor do they eNamine I lie vast array of 
situations for which they might he applicahle. therefore, it was 
decided to focus this paper on concerns central to CRTs in >ieneral. 
with .special emphasis and CNampIcs comin«: from mathematics. 



MAJOR ISSUES AND STEPS IN CRT DEVELOPMENT 



This section of the paper provides a review of the basic 
steps in the development of CRTs and tlie major issues 
asiiOciated with these steps, Althougli many of the steps and 
issues have their counterpart in classical test development, 
the present focus is upon those considerations unique to 
CRTs and especially those relating to the development of 
such mearures in matliematics. It should be kept in mind, 
however, th.it the method chosen to resolve a particular 
issue at one stage in the development of" a CRT is likely to 
hiive ramifications for other stages in the developmental 
process as well as in the interpretation of the scores 
obtained, in addition, the most important but not neces- 
sarily Self-evident' of liiese implications are noted, and the, 
primary techniques and procedures that have been used a.s 
well as their niost iniportant advantages and hniitations are 
identified. 



Purpo.se and Defining Characteristies of CRTs 

It is a generally accepted principle that somewhat different 
kiiuls of measures have to be constructed lor dilTerent 
purposes. Tliis principle also appears to carry over into the 
development of CRTs. For example, to ensure an adequate 
level of test reliability, a CRT or series of CRTs that will be 
used in making a decision aboui an individuaTs level of 
perfornKince will need to be longer than one used for group 
assessment. Similarly, the focus of the CRTs used for 
managing an individualized learning .segment of a small 
malhcmatics unit would be narrower than tlial used to 
measure end-of-year performance of all students in a class- 
looin. The characteristics of_the target audience, such as 



tlieir ages and ethnic backgiounds, are also likely to 
intluence tlie test construction process in terms of the 
appropriateness of various kinds of stimuli and response 
.formats. Further, tlie anticipated nutnber of students to be 
^ tested and the context in which the testing will occur 
influence test tormat, production, distribution, administra- 
tioji, scoring, and analysis. 

Figure 1 lists some of the basic purposes that have been 
noted for using CRTs in terms of tlie decisions to lie made 
and the focus of the testing (I iarris, l'')73; Skager, l')7.>). 
Three major kinds of decisions have been identitled. Deci- 
sions relating io the organization of an instructional pro- 
gram are classii'ied as planning decisions. Validating the 
quality and competency ot' a program is encompassed by 
certification decisions. Decisions based oti additional 
investigation of the instructional program are included in a 
research category. With respect to the focus of the testing 
program, three classifications are considered. First, a CRT 
can be primarily involved with (he individual student. 
Second, groups of students such as a cia.ssroom or ethnic 
group can be the tocus. And third, the instructional 
program itself might be the primary unit of concern. .. 

Figure 2 illustrates how differences in tlie target audience 
would result in different test items for the same objective. 
From an inspection ol tliese figures and the foregoing 
discussion it is apparent that the different uses of CRTs- 
may require different kinds of measures and test models. 
The fundamental issue underlying these differences is the 
degree to which the CRT or set of CRTs will provide 
precise and reliable information about sttident performance 
relative to various feasibility constraints associated with 
gathering this intormatioiK such as costs and lesting time. 



Figure 1. Purposes for Criterion Referenced Tests 



ri.STiNc; i\;C)(;r.\m 



Student 



"Group" • 
(Classroom, 
ethnic, SHS 
culturah or 
geographical 
groups) 

"l^rogrum" 
(A program 
may he used 
with one or 
more groups) 



I'LANNINC; 

Diagnosis, Prediction, 
iind Placement 



ClassrotMn management 
Curriculum selection 



Organization and 
sequencing of instruc- 
tion. Curriculum and 
product development. 
Needs Assessment 



TYPh' or i)i:asi(>.\ 

( UIVl lUCA rioN 

neterminiiiit>n of 
"mastery," grades, and 
s u cc e s s 0 ( p] ac e me n t 

Instructional uud 

administrative 

accountability 



Program Evaluation 
Analysis o( subject 
matter domain 



Kl. SI ARCH 

Interactions tnecu 
the snidcnt. the 
g:oup, and the 
progruni 

Interactions bevwcen 
group! s) and program, 
e.g., do students 
with certain charac- 
teristics function 
better th:vn others in 
a given situariou? 

C 0 m i ra r i i o n s h e \ w c e vi 
•types of programs 
Analysis of program 
conipotients 

Development of n.^easure- 
ment methodology 



Figure 2. Comparison Jof General Stem Formats for the Same Objective at Different Grade or Age Levels 



OBJIXTIVI' 

The student will indicate by matking 
niathematics iii everyday life. 



e appropriate choices on luj attitude scale his/her appreciation of the importance of 



Format. 



Sample 
Items 



The student is given a test booklet. Each page is a 
different color and Has a familiar symbol at the top of 
the page, such us rabbit. Each page also has the i 
words "Ye,s'' and ''No." Directions are provided to/ 
the student so that he/she understands to mark the/ 
choice that answers the question that, is read by the 
teacher. I 

The teacher reads the following kinds of directions 
i\]n\ questicMis: ''Now turn to the red page with the 
rabbit at the top. . . . Now \ am going to read you the 
next question. 'Do you have to know how to wcirk 
with numbers to tell time?'. . . . Now turn to |he 
yellow page with the duck at the top. . . . 'Do' you 
have to know how to add and subtract number^ to 
catch a ballT, . . . Now turn to the page witli/tlie 
table at the top. . . . 'Do you have to know how to 
work v\ith numbers to buy something at /the 
storeT. . . . and so forth. 



Comments. Note that the child does not have to read (helques- 
^ tions, the questions are asked about him or Herself 
rather than some other person, and that the kiilguage 
level and activities are within the students' repertoire 
of experiences. 



T\Vi:iJ-III GRADKRS 
The student is given a set of statements and a slj!*:s of 
choices ranging tVom '\Slrongly Agree'' to "Strongly 
Disagree." The student marks the number of his 
choice on a niachine-scorable answer sheet. 



fhc following kinds of items might appear on a scale 
to nieasure the objective: 

1. Persons who fill medical prescriptions need to use 
mathematics frequently in their work. 

2. Only a very small part of a carpenter's job requires 
him to use mathematics, 

J. It is more important for a bank teller to make 
. friends easily than it is for him or her to n^ake 

arithmetic computations accurately. 
4. In order to be a good plumber, one would have to 

be able to do basic arithmetic computations with 

fractions. 

The statements are balanced with respect to being 
positive or negative regarding the importance of 
mathematics so as to reduce any irrelevant tendency 
to agree or disagree. 



Objeclivos Chosen 

A;^ iv.uod m ilic pifliice to paper, one o! ihc eNNOii!i:i! 
leulurcs of CRTs is tlicir Uuindation in clearly Jernied 
educational obi<^'c-*.ivcs. Tlicrc arc. however, a number of 
issues associated with how these objectives should be 
developed and slated. The essence of these issues may be 
suniiiiariiied by the question: ''What kinds of objectives 
should form the basis for a CRT system ^' 

Almost all developers of CRTs agree that lo assess per- 
formance within a given area requires the coiistruction .)f a 
set of CRTs rather than a single measure. The problem then 
arises as to which objectives within an urea should become 
the basis for the CRTs and how broadly or narrowly these 
objectives should be stated. »h at is, the extent oi each 
ohfectivc's coverage- The statement of an objective may be 
further delineated by defining the coudHiom under which 
the measurements are made (e.g.. open vs, closed book, 
with Qr without the aid of a sheet containing needed 
formulas, and so forth) and/or the siandards of perlorm-. 
ancc to be reached in order for tlie objective to be achieved 
(e.g., "80 percent correct "in less ilian 2 minutes,'' and 
so forth) (Mager, l^^b2; Popham, 1^)05). Implicit or 
explicit assumptions about tlie relative importance of the 
objectives and the characteristics of tJie area to be assessed 
(such as the logical and/or sequential organi/.ation of the 
objectives in it) also intluence decisions as to which objec- 
tives should form the basis for a CRT system (Popham, 
1^)72). 

The resolution of ihe i,ssues associated with choosing a set 
of objectives usually hinges upon the anticipated purpose(s) 
of the CRT system. Thus, there is a consideration of the 
degree of precision needed relative to various practical 
considerations. This balance is illustrated by the lOX 
Criteria for Objective Selection (Popham, 1^^72) presented 
in the Appenc^hx. 

■*^^^nie of ihj: procedures ihal iiave. been used lo develop 
the ohjective^^' bases for CRTs systems are described brielly 
he low: / ' . 

^- ^'-xpcrt j/idgnient A vmall group of experts within the 
area to be iissessed meet and, on the basis of their knowl- 
edge and experience in the Held, jointly decide wliich 
objectives are the most important to measure. These objec- 
tives arc then screened to determine the feasibility of 
measuring them and, whore necessary, to clarify and/or 
redefine' them. This is probably the most common 
approaclV. 

2. Consensus Judgment. Various groups such as com- 
munity . representatives, curriculum experts, teachers, and 
school' administrators decide w^liich objectives they consider 
to be,' the most important. A nieasurement and/or cur- 
riculum expert is then responsible for dellning and stating 
these; objectives in a way that would permit them to be 
assessed (Klein, 1972; Wilson, '1973). 



Cm I is,'ulmn .An,!!) si>. A (^mhi oi" .-iir! iciiiu'u e\pe:i^ 
:in:3iy/e> given sei of en i icuhuvi * n';i'ei iais Mich .is ie\i- 
book> ill ordei" -lo iiiciiiif). and w licio iieccNsaiA in'oi . iliL- 
objectives that are the focus of tli esc inaierials ( Bakei. RA .. 
1^)72). ^ ^ y 

4. Analysis of the Area to be Tested, An in-depth anaU s^i/ 
is made of an area sucli as mathematics in order to ideijKfy 
all contents (such as single-digit numerals) and behaviors 
(such as multiplication with replacemenl ) ttial are included 
in that area (Glaser & Nitko. 1971; Nitko, 1^73). Tiie 
objectives associated v/ith these contents and behaviors are 
then organized in some systeniatic fashion, such as in terms 
of a hierarchy and/ or sequence of objectives for the 
components of the subject area (in mathematics usually 
referred to as "strands") (Nitko, 1 97 k Roudabush, i071; 
Popham, 1972). 



Item Construction and Selection 

Once the purpose! s) and the objectives for the CRT system 
have been delineated, the next step is to construct and/or 
.select test ucms or tasks to measure the objectives chosen. 
This is one of the most difHcuh steps in the total develop- 
mental process because of the vast number of test items or 
tasks that miglit be constructed for any given objective, 
even those that are relatively narrowly de lined. For 
example, consider the following objective: 'The student 
can compute the correct product of two single digit 
numerals greater than 0 where the maximum value of this 
product does not exceed 20." The specificity of this objec- 
tive is quite decejMive since there are 29 pairs of nutnerals 
that meet this requirement ' and at least 10 different item 
types that might housed to assess student pet fornuince (see 
Figure }). Ftirtlier, ei^ch of the resulting 2^^0 combinations 
of pairs aiui ilem types could be modified in a varielx' of 
ways that might iiilhience whether tf)e s\udent answered 
them corrcctlN^ Some of liiese jnodifications are: 

• Vary the sequence ol' numerals (e.g., 5 then versus 
then 5). 

• Use different item formats (e.g., multiple choice versus 
completion). 

• Change the mode of presentation (e.g.. written versus 
oral). 

• Change the mode of. response (e.g., wiitten versus oral). 

It soon becomes evident that even a highly specific objec- 
tive could have a potential item pool of well over several 
thousand items (Hivcly, 1970. 1^)73: Bormuth, 1970). 

The nutnber of items to construct tor each objective is 
influenced by several {'actors. Some of ti^ese factors are the 
amount of testing time available and the cost of making an 



4 



./ 



Figure 3. Item types using the coiiteaf of numerals 3 and 5 for the objective 



The student can compute* the correct producl of (wo f^ingle digit numerals grca'cr ihan 0 where the niaxinuini value <^r this 
product does not exceed 20. ' 



5 

a. "X? 

b, 5x3 = 

c. f5K3) = 

d, 5^3 = 

c. 5 times 3 = 



W Tlic product of 5 and = 
r,. 5 X = 1 5 

h,' irx=5 and y=3, what is tlie vaUie ot\xy'' 

t What numeral niultipled by 3 will equal \ >'! 

j. .lohn has 5 apples. Sally l\as 3 times as many apples as 
John, How many apples does Sally have? 



interpretation error, such as saying that a suident has 
achieved mastery when he has not. A survey of curreni 
measures reveals that tlie usual practice is to use about 
three to five items per objective. This praiiice appears to 
stem more froni leasibility constraints than any sound 
toundation in psychometric ihcory or technology. 

The particular iicni construction and selection approach 
or combination of approaches chosen to dciine a CRT 
program is a major consideration. One reasoil^or this is liial 
the metltods used have a direct bearm^ip. the utility and 
. content validity of the CRTs i^Qvc)0pcd and the inlerprela- 
tlqn of their scores. For cxanij>K\ if tliere is a hierarchy oT 
ol">jcctive5^ and if a CRT'is t^be based on an objective at a 
given level of- general it vnn this hierarchy, then ii is likely 
that the items uscdAvill be sampled from tlu^ relevani 
suliol')jcc lives, liijMss there i.s a specitled hierarchy or an 
organization oP objectives, such systematic sampling is 
impossible. When this latter siuiaiton occurs, one luis much 
less conpdencc thai the nieasure(s) developed really assess 
the \vKc)le ohjeclive. One reason for ill is concern is tliat 
wHIunii a systematic plan for guidance, it is very easy lo 
1js( construct items for those aspects of an objective that 
are most amenable to measurement rather tlian those 
aspects that miglil be .considered most germane or critical. 
On the <ither hand, it also seemj> likely thai responsible :esi 
developers working without an overall plan are more likely 
to focus tlieir attention on the most salient (and perhaps 
most frequently taught) facets of an objec''ve than on 
those aspects that may be just .tangential to what a student 
must really know or be able to do. Thus, the best 
conipromise lietween systematic sampling ("and thereby 
improved content validity) and. potential instructional 
relevance is to first develop a provisional sy,^lenui{ic plan 
and then a,ssign items to some or all the components of this 
plan based upon their perceived relative importance. This 
latter approach is the one 'most frequently adopted by 
major test publishers (Wood, 1^)6 1 ), 

A related issue in construction and selection of CRT 
items IS tlie degree to which the items should be sampled 
with respect to their relative dilTiculty witliin an objective. 
It is a well known and frequently used principle" of test 



con.struction thai slight ■ clianges in an iten) can affect its 
difficulty. This is most readily accomplished hy varying the 
homogeneity of the alternatives in a multiple choice item, 
such as in tlie two examples below: 



Kiglit luuidrethseqiTTrl 

a. 800 

b. 80 

c. 8 

d. .08 



Hight hundredtlis equal 
^800 

b. ^7i^ 

c. .08 

d. ,008 



The extent to which the items v'vitliin an (objective are 
s.miplcd witli respect to dilficui;\ has. ol" course, a direct 
bearing on the interpielation of ihe seores obtained. In 
other wards, if only tlie most difficult items are used, then 
the phraSe "mastery of the t^bjeciive'' has a ver\- dilTerent 
meaning than il' the items were sanlpled over tlie fidl lange 
of difficulties. The fact that the diflleulties of items on 
CRTs (and tlius their scores) can he intluenced so easily 
poses real problems to CRT users, Tti blindly assume tliat 
the scores' obtained indicate an' accurate apj^raisal of the 
degree of mastery achieved, merely l>ecause a measure is 
called a ''CRT'' is an exercise in self-deception. 

A third consideration intluencing the construction and 
selection of items is the degree to which an item is 
dependent upon or related to a panicnia! set of curriculum 
or instilutioiial materials and techniques (Baker, R., 1^)72; 
Skager, 1973), For example, if the instruction only gave 
students practice in solving niislliphealion problems in the 
form used in item types a-e in Figure 3, and if the CRT foi 
this unit only used these satne item types, then the CRT 
would be .said to be ''instructionally dependent." or biased. 
It is readdy apparent that, the more instructionally 
dependent the CRT, the more likely the effects of instruc- 
tion would be evidenced in the scores obtained with it and 
the less generality one could draw from these scores, 
regarding the student's mastery of the objective. On the 
other hand, instrufU^ Mially independent tests are morci^ 
likely to retlect a student's general ability. Thus, an 
"instructionally biased test might l)e preferred for such 
purposes as teacher accountability.., while an instructionally 



ERIC 



indcpcncicnt test might be preferred for scliool account- 
ahiliiy and fnr evahuition studies comparing the elTects of 
different programs. 

A fourth issue, und one whicli lias perhaps not icccived as 
much' attention as it should, is the potential interaction 
betwebn the objective and Iio\v""it is measured. It is often 
assumed, for example, that selected response items (e.g., 
multiple choice) serve as an effective proxy for constructed 
response items (e.g., completion or sh(^- answer) because 
the performance of students on the [■ .inds of items are 
highly related. Aithougli this may bo . .\.r;illy true, it may, 
not be true for certain kinds of objectives; and ftirthcr, the 
degree of mastery^ required to answer a constructed 
response item is usually greater tium it is to answer the 
.selected response item. The relative scoreabiliiy of the 
latter format. lunVever, has led to its u.se ahnost exclusively 
in published measures, including CRTs, it .sliould be 
recalled that anything affecting item difficulty on :\ CRT 
will influence the total score on it and" thereby the interpre- 
tation of that score. : 

The foregoing considerations have led to a number of 
different methods of selecting and constructing items ior 
CRTs. The general features of tliese methods are described 
below, but it siiould be renjembered that each of tJiese 
approaches begins with or involves the development of 
well-defined statements of the educational objective(s) to 
be measured, 

1. Panel of Fxpcrts. A group of measurement and cur- 
. riculuni ''experts" decide which items to use bused on 

their knowledge and experience'^of the field (Zweig, 
1973). When the experts involved arc classroom 
teachers,, tiiis approach may lead to highly instruc- 
tionally dependent measures, 

2. Systematic Sampinig. This approach is basically a varia- 
■ tion of the classical test construction technique. It 

involves developing for each objective a matrix of 
contents and behaviors (or tasks) to be assessed. Items 
are then systematically sampled within this matrix and 
perhaps along a' third continuum of item difficulty as 
well (Wilson, 1 973; CTB/McGraw-Hijj, 1973). 

3. Systematic Item Generation. This is the most sophisti- 
cated of the various approaches and slarls with the 
assumption that all the relevant contents, behaviors (or 
tasks), stimulus and response characteristics, and related 
factors can be defined for a . given domain or universe of 
objeciives (Hively, 1970J973;Cronbach, I971;Skager, 
1973). Basic item forms or ''shells" arc then con- 
structed. Various techniques can then be used to 
generate the necessary items, including the use of a 
computer (especially in the field of niatliematics) to 
meet certain'-'jjrespecified criteria for coverage of the 
objectives (Kriewall A Hirsch, 1969), 



It is evident from these descriptions that as the stiphisti- 
cation of the method imprc^ves, 'generality of the results and 
the costs of tc^^ construction tend to increase. Inuther, the 
particular meiliod chosen will be intluenced by the nature 
of the efforts that liave been devoted to the generation of 
the objectives on wliich the CRTs are based and the 
purposes for which they will be u,sed. Finally, the dct^ice of 
sophistication may be limited still further by the clarity of 
the domain to be assessed, such as mailiernancs versus 
"citizenship" and the measurement technology available for 
constructing measures in that domiiin (e.g., academic 
:achievonient versus personality development). 



Improving Irem.Quality 

It is a- i]intic IhiU all tests and measures be field tested 
prior to basing decisions upon them. Although il appeai> 
that this axiom is oHen ignored, ■ there arc a number of 
methods that have been suggested for analyzing CRT items 
in order' to identi^' tho,se that are "faulty." It should be 
noted, however, tli.it ;m item that is considered ^*faii!ty^' or 
"good" using one liiethod of -inalysis may not be identified 
as such using another method (Popham & Hu,sek, 1969). 
This is illustrated in Table 1. It- is apparent, therefore, that 
the final version of a test may be inlluenced greatly by the 
method of item analysis chosen tor its construction (Cox Sc. 
Vargus, 1966;Roudabusb, 1973). 



Table I. Results of Different Item Analyses 



Iti-MU OiiTiculiy Possible point 
Itoin hisorial r with 

No^ tJV^'-'ii Posttest ■ score on test 



5Q% 



\007r 
50% 



0 



.00 



Possible 
sonsiiivjiy 
lO inslruclion 

High 
Low 



Thi^re are two basic concepts underlying the item analysis 
techniques associated with CRTs and at least one of these 
cons'tructs is present in each ar.'ilysis method. These two 
constructs are as follows: 

/ ' • . ■ . 

L An item is considered "good'' if it is sersitivc io instruc- 
tion, that is, if performance on it is related to the degree of 
instruction obtained. Tfae methods that rely heavily on this 
construct are usually used when there is little or no varia- 
tion in student scores^ at any one testing. There are 
problems with such methods, however: they assume that 
the instruction was indeed effective; they tend to produce 
instructionaily dependent measures; and they are biased by 
maturation and other irrelevant systematic factors that 
niiglU tend to improve scores over time. Further, the use of 
a technique ccmphasizing sensitivity .could easily lead to 



ERIC 



sniiic rmhci ii)tcicsiii!iz c!i:n(;M i c;ist)iiinii il iMie iiicd lo 
iinpuA'c {\k Icsl and an ii'^i iikiivMUil pr<^ui:ini ^ii ilic s;ii\ic 
lime. 

2. All item +s fonsitlcicd "good" if ii (liscriniiniffi s hci\\L\'i) ^ 
(li(»so wiio did Weil vl'isus iIu^sl- wlio did ptuiiK' on ilic icsi 
:is J whole 1)1 Noinc "tuiisitic" criterion, such a^ pcr- 
foiniancc in ilic ncxi step iii a sequence of inslrueliiMi. This 
involves all the classical item aiial>'sis aj)piciaciies and as 
such one must accep\ ail the assnnipiiv)iis, advamaiies. and 
disadvantages thai are nornially associaietl with these 
lechiiiqiies (especially item and test variancL'^). ' 

The kinds o\' analysis n^ethods and their variations that 
have been su^igesied are listed below: 

I . Comparison CJroup. Ciive the test to two groups who are 
hnown lo possess dilTerent degrees of skill w'uh respect to 
the ol)iective(s) measured. One way of doing this is lo give 
the test lo those w'ho have versus those who have noi 
received instrucli(Mi dealing with the objeetive. .\ second 
method is to give the tesi io those whose normal activities 
lequiie dilTerent levels of attainment of the skill measuied 
(e.g,, carpenter versus auto incchaiiie for an objective 
dea(ing wiih com))Uting the si/e of various ucometric 
ohjecis). The next l)asic step is to identify tlu^se items that 
discriminate best l)etween the groups in the desired direc- 
tion (that is, the presumabh' more able should do belter). It 
is important for the purposes of CRl^ interpretation that if 
two separate groui)s are involved, they have the same 
general intellectual abilit}' or other characteristics that 
might bias the lest results. 

J. Single Ciroup, Pre- and Post test. Ciive the test lo the same 
group twice, once before instruction and again alter instruc- 
tion. Identity those items that discriminate betw-en the 
two tes: sessions. A number of item analysis techniques 
designed specillcally lor CRTs have used this approach 
:PophanK p)7(): O/enne. h^T 1 ; Kosecoff Klein, U)7,T, 
RoudabuslK 107,^). ^ 

• 3, Single Cirtuip, Positest Only, c;ive the lest tc^ one group 
of individuals aftei a fi.xed period of instruction, thai is, ail 
examinees have had the same amount cd" opportunity to 
achieve the objective. If the linie allotted i:. somewhat less 
than that needed for all the students to achieve the objec- 
tive and the students are somewhat heterogeneous in their 
ability as is common in most classrooms, then the t\pical 
item analysis procedures such as computing point biserial 
correlation coefficients may be empU^yed to identity fauU\' 
items. .An internal criterion (total score on test) or an 
external crileriini (success in achi -ving-rf more advai iced 
skill) may be used (Cilaser. P)()3|:\0ne weakness in ihis 
approach is that items iiaving very higli or low dilTicuities 
v.'ill tend lo h.ave Kuv biserial coefllcients even though tlie) 
may be very sensitive to insiructioit- An extreme case 
would be an item that would be failed by everyone piicir lo 
instruction but j)assed by everycme after instruction. A 



secoiul weakness is thai gener;ii iniclleclual abilit\" as well as 
the efteci> o{ insuuciion ma\ liitluence the resuliN and 
there is no ua\ of cieanK sejxualing ihese mlniences. 

4. Single Ciroup, Repealed .Measuies. I\acli siudenf peiiod- 
icalK' lakes the comj-ilelc lest iiniil he is able lo acliie\e 
Miaslery. A record is kept of ilie number of times the 
student passes and fails each item. Anal\'sis is then made lo 
determine wiiether the item generally exldbits the desiied 
pattern of failure ihen success (wiiii no reversals), i.e.. a 
desired [)attern woultl be IT PP and an undesirable paitern 
wt)iild be -ITT^1\ This approach is only .api)li':able where 
there are no carry-over elTects frcnn test session lo lest 
session or where truly {Parallel items may be constructed Inr 
each test session and then systematically counterbalanced 
across sessions and examinees. The advantages of this 
approach are that it permits relevant scaling of an item 
within an objective and the analysis is made alter all 
students have become ''masters." The labcn" involved in this 
approach and the likelihood of finding items lhal scale well, 
however, have not contribuicd to this methocrs popularil\-. 

One issue that is related Ic^ item analysis procedures and 
that seems to be neglected with respect to CRTs is the 
l^roVjleni ol' kn')wing whether the final set of items provides 
adecjUate coverage of the (M)jective, In other words, how 
many i'.. nis are really needed lo sample sutTiciently a given 
■objective? F-urtlier, a procedure is needed for deternnning 
■^' i-ether some ol' the items are leduiidant. Although these 
kinds (it' issues have been examined in part with the inore 
iraditional kiiuls of tests, the unique demands ol' CRTs will 
correspondingly require new ways rf dealing with this 
gen,"'ral problem of knowing when one has approj'iriate and 
efficient coverage. 

Content Validity 

A major concern of CRT developers is in establishing the 
content validity of their instruments. The three most 
common ways that have been used to do this are as follows: 

1. S\'stematic l^est Development. This aj)proaeh involves 
presenting the rationale for the systematic pro:edure 
employed in terms ot' why il should result in a content valid 
test' {[lively, 1070, 1^73). 

2. H.xperl .judgment. Con lent exj^erts are given a variel>- of 
objectives and the items used to measure them, The\' are 
then asked to assign the items lo their ''appropriate" objec- 
tive. The degree to which they are able to do this a ecu rat el >• 
rellects ou item-objective consistency and thereby on 
ciMitcnt validity; that is, is a given item reali\' measuring the 

( Mjective for wliicli it has been constructed? (l)ahl, lo'^l ). 

,v Item Anal\sis, it is possible to conipute interind con- 
sistency inilices for a Cl^T and/or see wiiether an item on a 
given objective correlates more highly with otb.ei items lor 



tilis iihjcctivc lluiM it d<>os with iiciin< mi oliicr ohJccMvcs. 
These appro:iclics arc limited by all the dangors of intLM na! 
consisicncy validation techniques plu^^. liie potential 
problem o\ no variance on the measures (that is. ihe sui- 
dcnis all receive the same score). The latter problem, 
however, usually appears to be more theoretical than 
actual, because slurlenls do.yary in their pcrrorm;mcc. This 
variation may be due lo a number ot^ factors including the 
students' general intellectual ability, cultural and environ- 
mental backgrounds, and llie quiility of instruction ihcy 
rfX'cive. If enough students are tested, tf^en one will dis- 
cover sufncient variance in the levels of performance and/or 
in the lime it lakes lo achieve a given level. Reports ol^ "no 
variance'' usually stem fr(,m failure to sample enough stu- 
dents and/or from the failure to examine ihe raie at which 
students master items and ojijeclives. Thus, allliough one 
might conceive of a situation in which no variance miglii 
occur in a given classroom, it is hard to imagine how ibis 
migiii arise across a vaiiciy of classrooms uiiJess, ol ctuirse. 
the test was totally iiiappropriaie for the lull rahge of 
examinees for wliom it was consirucicd. The real probLMn. 
therefore, is not in finding variance but in ideniifying just 
thai portion of the variance ihat is due to the student s 
degree of mastery of the particular objective on which the 
CRT is based rather than variance due io st)me extraneous 
inlliiciice. 



Item and Test Bias 

"hem bias" may be defmed as a group by item inici- 
aclion: thai is, t!ie j^jroflles of pc.rlorm;nu'e ol different 
groups (such as m;il > veisus fenuiK'^^ ,Ki>).ssal! items iii the 
test are not parallel 'T ' i hia . defined as a group by 
test interaction: tl^ in, groujx< do inu have the saiiie shajXHl 
profile of score oss ihe various tests being considered 
(Cleary, \'Hy(^: .ary Milton, }^H)K). Litlfe altenlion has 
been paid lo CRTs with lespect to these kinds of biases, 
allhougii they liave become important topics within the 
general measurement field. 

It should be noted, however, that the identification of :i 
test or set of tests as being "bia^^ed" with respect to certain 
grouj)s docs not necessarily nK\iii that the measures should 
be revised. Tlie reason for this is that such ''bias" may only 
mean that the educalional and cultural experiences of the 
groups taking Ihe tests are systemalicaily different and the 
basis for . these dilYerenccs and how to deal with them 
should be exauiined. It is entirely likely, for example, fo\ a 
lest to appear biased simply because it draws more on the 
vocabulary from certain texts than it does from others, and 
the use of the more test-dependent texts is not random in 
the population of examinees. Wider use of the more 
dependent texts would, therefore, remove the supposed 
"bias" in the test: changing the test to be more representa- 
tive of the texts used would also achieve .le same result. 



Test Scores 

.\s noted in tlie preface \o ibis jiaper, one ol" ihe iwo 
essential features ol" a CRT is thai ,in indi\idujrs oi a 
group's ,score on i! is intcr[Meied in terms of the level o\ 
performance obtained with respect lo the achievement ot' 
llie obje''live{ s) on whicli the CRT is based. This type of 
score reporting is contrasted lo ' tlie norm referenced 
approach in which a siudent^s or a group's sec le is inter- 
jueied with respect to the perlbrniance i^f oilier individuals 
or groups (Popbam & lUisek, i^H>^)>.The primary advantage 
of the CRT approach is. therefore, fts ability lo jM(nidc a 
means toi" describing what the student {or group) can do or 
what it knows or how it feels without fiaviiig lo consider 
tlie skills, knowledge, or attitudes of others. 

There is some question, however, as lo whelher a i RJ 
can really do this (Klein, 1^)70; Davis, h)71: libel, h)?:). 
Foi exanij^le, if parents ai'e I old that their clu'ld luis 
mastered a given objective or ,set of objectives, their first 
question is 'Is this perl'ormance ,satisfaclory?'' In other 
words, they are asking wlicllier, the child is progressing 
satisfactorily and liie only frame of reference one can give 
in this situation is ilio rale of progress of other siudents. 
The fact tliat such a normative fratne of refereiice can easily 
be provided also points out that one can make ncHin 
referenced inierpreiatit)ns of CRT sciK'es. The distinctive 
fealiU'e of a CRT score must, therefore, fie in W's cfjiphasis 
on describing the absolute rather than the relative level of 
performance with respect to an objective or skill. Because 
of til is emphasis, different kinds of scores are generally 
repoitcd for CRTs than for norm referenced measures. 
Some of the dilTcreni kinds of scores tfiat can be reported 
for CRTs that rellect emphasis on objectives are listed 
below: 

L The n.umber or percent correct on a given ohjeclive o\ set 
of items than encompass a few liighly related olijectives. 

2, ''Mastery'- of a given objective or set of items where 
''mastery" is defmed in terms of a certain level of per- 
fortnance such as '■)0 percent correct, 

Tlic time it takes (such as in class hours or calendar days) 
for an individual lo achieve a given performance level 
(including wlnit has been defined as "n^aslery" (Harris, 
M^73). \ 

4, The time (in minutes or hours) it takes a stude»il to 
perform a certain task or set of tasks related ^c^ an objective 
(such as correctly con^puting the product of all single digit 
luimcrals), 

5, The probability that the student is ready to begin tlie 
next level of instruction (this may be based on both tlic 
number of items correct and the pattern of answers given to 
these items), 

6, The percentage of students who "pass" each item: that 



i^. ihc iK'n^N ^liiYk'uliy. This kiinl ol' >ci>rL' is ckAu- 
sivcly iJi prouran) .iIiJ.il li'ii u iicio . li ifcrn 'M i^isk is 
c<uisi(icJL'd iin[)o! in w-^clL 

or ail liic s^-tucN lislv'ii ii!'M»\c; ilic ones ihal liavc hccii lite 
locus ()!' most Jisciission afc tliose iliai ini))!)' ifr: sdi- 
dcni has jvliicvcd ''nui.slct \ ( Milhuau. 1 '^".2). The icasoii 
loi liiis alU'iiiioii is llial while ^llL■h :i >ei»rc coinc^ closest to 
Ific iiiuieM lying sfiiiil t»ra ('R'l\ lliere is laieK a good uay of 
dol'iniug esactK' uhal is ineain h\ "nusslevy." Silv, iiai> 
cieliniJioris, Mieh as -^5 pct\eni Loiieei, Lire rami^am; inii 
iheio is rarely aii> satislaetoi s eiilerioii for set ling siieh 
sUmdards of peri oniianec. l unl^et. a inasicry seure ol uMi 
hides liie true level ol >lijdeni {>ei iMriiianLe. In olhei U'ords. 
11 tlw snuieni failed ii» achieve niasiei\ did he miss by a 
li[lle or miss bv a gieai deal; in if he made it, tlid he Jnsi 
scjueak hy*^ i-inally. tlie piohloins iiiliereiu ii; ihc con- 
st rnclioii o\' items toe a CRT and es]ieciall\ lliose dealing 
wiiii the defining ol' the accept ;dile item i> }h*s, item selec- 
lion procechites, and item diirienhy .sovescly limii ihe inter- 
pielation ol'vshal is mean! In "mastery/' 

Packaging' ill! d Other Considerations 

How a CT^T is rinalK' pin i^ geilier aiui packaged is again a 
tunciion of the [)in j>o>e( s) for u hicfi it uili be nsed lelatiu* 
to the various kinds ol ccMisirainis in)fH)sevi on its develop- 
mem and use. When lliere is a N'asi limnhei of i>]iiecti^e^ lu 
Iv assessed and it is nor considered reasonable t^i develop a 



separate CUT foi e:ieh, one or moie of the roll»n\!Mg 
techniques are n^ed: 

1. Combine office ii\es ihal ;ne cvnisidereLi highb i elated tt> 
i^iie anotliei into a ^ingle mca^aie. 

2. Seieei a gionp oi' objeciives froiu the unA [V)oi of ^'I^iec- 
tives basCii on a >ei ol' ,ipj>ro]Miaie criiena (such as tln^se 
piesenied in t!ie Appendix!. 

.C Limit the scope o{ eacli v^bicciive sv> as to redu«.e the 
potcn[i:d nutiibei of iien^s and or tasks th;ii migtii be 
needed to me;isnre it , 

All of (liese techniques do. (>f eotirse, iCLiuire the use ol" 
cNperls in the lields of measurement and curriculum in 
oulet to make sound Lompiomises from both conieii! and 
iiiethoLlologFcal }")oints of vicaw 

Tlie methods o!' ■ packaging and distributing CR l s a'c 
quite A;iried. Owe of t!ie potentially si t'muiioua! 
techniques invob'cs printing tests on s[iiri( masters st^ tfi;i: 
each leacfier can duplicate tiie eopjcs ueedcJ I'm a giAci, 
class with (Hit having to jMirchase large numbers of ic^sl' 
biuikleis, A second innovation thai appears to have ]iromi^e 
is referencing the objective and e\'en the lest item to 
specific inslnietional materials, in one such case, the lest 
Tt>rm uas piinted i[i such a ^va^■ ih.it the teaeher \sas itdd 
inimeLli;itel> ^Nlietliei ilie student passed the iient, siiid in 
the event oT a tailnre a manual then directed te;icheis to 
materials for additional iustruetuin. 



PRESENT EI-rORTS IN CRITLRION RLFERtNCEl) TESTING 



This sectii>n of the paper summarizes the lesulls o!' a suv\e\ 
c(Miducted to assess cnnent elTorts in criterion referenced 
testing. All in(\>riiialioii is ivised on dafh jMovided directl\' 
by the projects themselves or through associaied lechiuciil 
reports, journal articles, and interviews. 

Although special emj^hasis w:is given to criteiion rel'er- 
enced measures in mathematics, lelated development.il 
efforls i]i other content areas ^\'eie aLs(^ reviewed. The list ol 
projects reported lieie is not exliaustive-^ hut can he vie\u\l 
as representative of tlie general state of tlie art in cn terion 
lefereiiced testing. 

I 'ive defining ciuiraciei istics ol' critenon reieieiicevl 
testing [)rograms have fx»en ideimCied. Tliey include pvo- 
gran^ locus, iiistructiv^ual dependence. ol")jective and item 
generalioiK lest models and pLickagiiig, and test scores. l-acli 

•^'1 lie projocls leporlcii in ihis -ection :ne Ihii^c lli;n ro^pmuleJ to 
mil suno>. l')(*it'ejN were st'lceted lor \\w siirv^'S oti ilie husisn!";!!! 
(.•\iciisi\c I UiC" scurcii :iikI i:eiK'ral knn^^le(lt:e ol llie nckl. It vjn he 
expected, tiier».'l\»re, ><iinc C l< iLstiriLi etinrls iikj> luivc hcen 
o\vrl<.HvkL'd <»r il\:u sunw proiirimis did imw respond. 



oi" tiR^se cliai;ict eristics lias aiieadN been diseussed in l!;e 
Inst section of this paper, however, sonic further cNplana- 
lion legardifig the scale used I'oi the in^lruciioiia! 
dcpCTidence c;ileg(M'\ is needed. 

California Test Bureau - Me(;raw.Hi!l (CTH) 
Prescriptive .Mathematics ln\cnt()r\' (P.\ll) 

l'(>n(s,CT\l is inteiesteit in the construe! ion ot' C RT pr'.h 
grams for classioom management. In particular ihe PMi uas 
designed to me;isiire 3."^! ol^jec lives represent mg the mathe- 
matics cuiiiculum nomimdl> taught in gr;Kies I'oui thiough 
eight, 

"^ A dieiuiUiaunis i lassir'icatit^n is used to dL-scriUe .i emonon 
n.-lVrcnecd proLaanTs dt-nrcv^ ol iiisi rueiivMUil l>i;is, {•r<>i:r;iii]^ ^^nli j 
hniic det:ri\' of iristriH-iiuiul dc[>eridene«.' tlevelop K-si iienis ili.ii .iro 
det»endcn1 on j p:irtieiit:ir eiinicuknn or sol of instruciion;it iikUo- 
ri:ils ;jful tCL-luiiques. IVoiajins v;iili .i Mn:dl deurei' of iiistnietioii.il 
dcpciulencc, on ihe otlier lunuJ. eonsirik-t lONt ikaiis (hat ufl' iu>j 
Mkleni on ihc spL-i.-ilK' skills or ^Mnliail ol :in iMNinuiiciij! 
pro fjiU. 



o 



Ifistnudonal IX'pcnLUvcy: Small. Neither Uie ohjeclives 
nor the test iicins rctlcct any insiiuclional bias. 

Objective aii^i i((^>f)i (jcnvmtu))] . Using a "lohslM^sus 
apptoach" objectives were eulled from the text materials 
most widely used in schools, collated I'roni each source into 
a single list, classitled into broader objeciives chissiilcations, 
and analyzed with respect to content and a hierarchical 
structure. Items were then developed to measure tlieJ^e 
objectives. (Note: On the PMI only one item is used to 
assess each objective.) 

Test Model and Paeka^iu^. The i'NTI is divided into four 
levels based on the objectives niost commonly taught in 
grades 4 and 5 and (\ i^ and 7, and 7 and S. The test 
items sample various levels of diiTiculty in each ol" the 
content categories represented. In responding lo the PMI, 
the student records his answers on unique, item specific 
maeliine-scoreable answer grids specially desiyued to 
eliminate guessing. 

In addition to the actual PMI test, CTB/MeGraw-Mill 
offers the following support materials and services: 

• Complete scoring and reporting services (that provide 
information on objectives mastered and not yet 
mastered) 

• Practice exercise booklets, an examiner's manual and 
class information slieet (to identify the class and tests) 

• An Individual Diagnostic Matrix (reporting the student's . 
score on each objective) 

• A Class Diagnostic Matrix (reporting average class scores 
on each objective) 

• An Individual Study Guide (that references pages in 
texts where material can be (owwd tor oojectives which 
the student did not niaster) 

• A Class Grouping Report (that lists students according' 
to their deficiencies in major content areas) 

Test Scores. Because one item is used to measure each 
behavioi, the mastery criterion for each objective is that the 
student correctly solve the associated item (Roudabush, 
1^)71). Pest syores are thiMVeported in terms of mastery or 
non-mastery for each objective. 

Four dilferent types of reports are available for reporting 
test scores: two individual reports for each student, and 
two reports for the class. The Individuul Diagnostic Matrix 
shows a profile of ilie student's mastery or non-mastery of 
the objectives. The Individual Study Guide gives page 
references for a selected textbook covering those objectives 
not yet mastered by the student. The Class Diagnostic 
Matrix summarizes test results for the whole class in terms 
of the percentage of students mastering each objective. And 
finally, the Class Grouping Report indicates how students 



. department of I: ducat ion 
IS of CAM in five school 
c mostly involved math 



fall into achievement groups within the n!\emaiics ciii- 
ricuhun an.d provides page leteicncc's to ihc te\ti)ook Iv 
used in the classroom l"or materials covering objectives il. . 
were frequently missed. 
Additional infornKition available fvom: 

CTli/McGraw-lUll 

Del Monte Research Park 

Momerey, California ^V^^)4i) 

Comprehen,sive Achievement Monitoring (CAM) 

I'oeiis . CAM is designed as a computer-assisu\ . multi- 
purpose evaluation system useful at ' individual, group, 
district, or state levels. 

The CAM model is based on two attii' ues: ( I ^ a llexible 
time series design ( testing at frequen whicii can 

be varied to meet the financial linjiti. u information 

needs of the user, and ( 2) a proceduic !• ij^hiig students 
and items which both introduces e^ . mto tesimg and 

increases the comprehensiveness ,,...ioi san.j'les avail- 
able from each testing session. 

At present, the New York Si 
has installed CAM or modiiu 
districts. Although prograr 
they arc currently being c\ d to science and reading. 

Instnictiona! Dependency : Large. CAM is constructed to be 
most clTective when the items relate directly to course 
objectives. 

^Objective and I ten}. Generation. Curricula are defined by 
behavioral objectives which are systematically coded tor 
easy ideniii'icntioii, retrieval, and grouping, and by one or 
more classifications. This process is typically carried out by 
potei^iial system users (that is, teaclier groups). 

Wiui respect to oi^jectives specification, a ''behavioral 
ar.i lysis" of course content requires that the user ( I ) 
prepare a topical course outline, (2) specify the general 
course objectives derived from the content (in non- 
behavioral terms), (3) specify the terminal course objectives 
(in behavioral tcrmis), and (4) specify enabling objectives (in 
behavioral terms).- Objectives are then oigani/.ed into classi- 
fications, typically utilizing Ammerni.m and Melching's 
(1^)66) classification system for the spjcjficily of instruc- 
tional objectives by their relationship to terminal student 
performance. 

Items are developed by system u.scrs (teacliers) direcMy 
from objectives and arc then judged (typically by the item 
Vv'riters themselves) for their consistency with the objec- 
tives. Considerations of error fronr guessing, ease of scoring, 
criterion referenced versus norm referenced test interpreta- 
tions, and general item writing skills (that is, ''the iteni stem 
must be worded lo require specific response'') guide item 
construction activities. 



10 



ERIC 



Test MiHlcl and Pcicka^ini^. Thi: lypical >c{ t^fC^AM tests is 
(jonsluictcd around llio slated (Objectives o\' the eouise or 
program [o he evaluated. ObjeL'tives, items, and test lornis 
ure typically generated by system users in accordance wiih 
instructions provided iti a user's manual. r 

Oeiierally, a" pool of items is constructed with approxi- 
mately 4 to 10 samples pei objective. Througli random 
slratitied sampling items arc assigned trsi forms creating 
parallel test lornis or jnonitors. Students receive ihe test 
lorins in a random order lit fixed testing intervals ( Deier- 
miued by the user's information needs). Each test form 
contains a tlxed numbei of items representing objectives 
whicli are taught between test administrations. Test forms 
are usually short, requiring from iO to M) minutes of 
testing time. 

Tcs( Scores. Through sampling of t,est items and testing at 
frequent intervals, CAM generates performance data on all 
courf.e objectives in relation to three phases of time: bi-fore 
instruction, inmiedlately after instruction, and retention 
over long periods of time. 

Alter each test administration each student receives a 
report listing the correct and incorvcct responses to every 
item as well as total scores on current and previous tests. 
Group data are also provided in th2 form of percent 
achicvcnicnt by desijgnated objectives for each test -admin- 
istration. Finally, achievement profiles wiiich graphically 
display the level of 'achievement (in terms of percent 
correct test scores) tor all previous and current tests on 
selected objectives are available quarterly. 

Additior.aF information available iVom; 

l^obert OM^eiily 

Chiel\ Burc-ni cif School Cultural Research 
Univerj^ity of the State of New York 
State LulucatioM Department 
Albany, New York i:224 

William Crorth 
School of L:diicali(Mi 
University of Massachuseits 
Anihersi. Massachuseits 01002 

Individualized Criterion Referenced Testing (ICRT) 
I'orus. ICRT otfe's criterion referenced testing piogram.s 
emphasizing individual student achievement and providini- 
two basic kinds of information: first, the specitlc 
knowledge and skills which the student has learned, and 
second, the specillc knowledge and skills which are the next 
instructional steps to be mastered. At present such testing 
puigrams are available in reading and mathematics; tlie 
following comments will focus primarily on the mathe- 
matics system. 

Imtmctional Dependency: Large, The basis for the crite- 

ERIC 



rioi^ referenced Jests is a set of specilled insl-ructicMiai ob\^'^- 
tives which describe tlie ContinuiHis Prr.gress I ab»u:iioi\\. 
Math program. 

Objective and //cm (icncrafil^n: Instructionar obieclives 
relcrenccd to the rontimunis Progress I,aboratc/r\ 's rnath 
curriculum are arranged from liie .most elementary lo ila* 
most difficult, forming an instructional continumn. I-rcMU 
(liis instructional continuum tliose objectives common to 
most curricula and expected of most student*;, are selected 
as testing objectives. These selected objectives, arranged 
with respect' to item difficulty, constitute a lestuig 
continuum. The testing continuum is then used as a bu.^is 
for item and test generation. 

Tesf Modvis and Packaging: ICRT provides test kits lor 
each grade level 1-S. Hacij t'?st kit iias sulTicienv tests for \\\\ 
average class, a Teacher's ManuaL a scoring template and an 
orientation kit. In addition, each kit (with the excepiion of 
level I) contains multiple copies of the grade level test 
booklets as well as multiple copies of booklets for up to 
two levels below the indicated grade level of the kit. 

Tests are designed to be self-administered or administered 
with teacher guidance. Ail the tests are power tests with no 
implied time limit, Each. test has approximately Id items < 2 
items per objective). The student records his responses to 
th^ test items oir computer cards. Directions for test scoring 
arc included in the teacher's guide. 

Four kinds ol» score reports arc available:' a District Sum- 
mary, a Building Summary, a Class Summary, and a 
Student Summary, The Student Summary provides pre- 
scriptive instructional resources, 

Tcs t Set )res : S t u de n t s ' sc o r e s b n e ac h o h] e c t i v e a re re p on c d 
to District, Building, and Class Summaries: students' saues 
are reported in terms of how many students are at various 
working levels (a studetit's approximate working level i^ 
determined by the tlrst test booklet in which he o: she 
missed 3 or more objectives). The Student Summary is 
intended as a prescrij)tive instrument, indicating which 
objectives have been mastered, which require review, ;iiid 
which ,sliould be learned next. In ad-jition, prescriptive 
instructional resources are suggested I'or objectives which 
the student needs to leview or learn. These prescriptive 
guides are referenced to the Continuous Progress Labora- 
tory Math Program, tlie supplementary drill tapes, and 
three additional curricula , selected by the user. 
Additional information available fixim: 

Louis Miller, Vice President 
Educational Progress Corporation 
3000 Sand Mill Road 
Menio Park, California 04025 

Charles Carlson 

Educational Progress Corporation 
4^00 South Lewis Avenue 
Tulsa, Gklahon^a 74105 

1 I 



Jnstnictional ObjccUvcs Exchange ()0X) 

i'}fcu... A tiilcrioii rctcrciiaHl test program has been 
ckvcloped lo cDiiiplcincm ilie lOX objcclives coHcclioiKs. 
The decision to develop iliese obieciives bused loMs 
reprcsenls an effori to provide leacHly usai:»le suj^ptiit male- . 
■ rials lo assess individual siudeni [)iogiess and u> lacili'ate 
classroom nianageincni. 

Insmictiofui} Dcpaidcmy: Small. Neiiliei the objeclives 
hookleis nor ihc criterion referenced lests are based on any 
parlicular curricukim or insiructional progriim'. 

OhjeciiYe and Item Cjcucraiiou. Within each subjeci area 
objectives arc defined in terms of rdcvani I0])ics and skills . 
at three levels of generaiiiy. Criteria for sampling the mosl 
general categories nichide importance of tlie area, economy 
of pvocluclion into tests, and practical scoreabihiy. Selec- 
■;;.lion ofjhe type of learner behavior to serve as the S])ecil'iL 
objective is then guided by considerations of iransforahilily 
or gcneralizabiiity within • a content area, importance, 
tcrniinality (that is,. .the"' Highest step-in a hierarchy K Hans- 
fcrability outside the area, ease in scoring, and amenability ■ 
to histruction. 

Rooted in WclLs Mively's ( I ^)70. 1 ^)73)'iiem form analysis, 
expanded objectives (called amplified objectives) arc used . 
to define permissible slimnhis and response options for 
i.tcni generation. For each objeciive only one type of test 
item is used; the- associated item format is tiien carefully 
defined by an aniplified objective. ■ _ 

Test Models and Packaging. lOX provides manuals listing 
objectives, sets of criterion referenced tests, and a user's 
guide or tcst.manuaL In the area of elementary niathe- 
ma.lics, fgr,exaniple, there are five independent setsof crite- 
rion referenced measures which cover ihe nuie mathematics 
strands identified by the California Slate Depariment of 
Education,, For each set of tests a parallel set 'is available to 
^ facihtatc pre- and posiiesting (ihat is. each set of tests is 
available in a form A and a form B which contain parallel 
tests). 

Tests are distributed on one ■ page., preprinted spirit 
masters 'which can he used by teachers to duplicate suf- 
ficient copies for their students. The typical test is multiple 
choice in format, contains 5 to 10 items and requires about 
30 minutes to complete. The test manual provides a list of 
objectives in that Urea, sample tc.^l items, complete inslvuc- 
■• lions for test adnfinistration. answer keys, and a guide for 
classifying scores in terms of achievement levels (whetlier or 
not the student attained mastery). 

Test Scores. Although directions are provided in a user's 
guide tor classifying scores into mastery groups, ]0X does 
not provide forms for reporting scores or suggestions for , 
taluilating test scores. 



Additional intormatioii available; fVoni: 

1 n s I r Li c t i o n a I 0 b\ e c l i v e s " l\>; c ban g e 
Box 240^)5 

[..OS Angeles, California 'mi4 

MINNEMAST Curriculum Project - University of Minnesota. 

Focuy. Ti\c MINNl;>.tAST Project represents an experi- 
UKEfl'al eflbrl to develop a coordinated and sequciitial 
mathematics and >.dence curriciiliiui for the elementary 
school. As part o. the evaluaiion of this project, a tech- 
nology for criterion referenced test co!i si ruction wa's 
developed by Mively and his 'associates at the Univcsily of • 
•Minnesota. These tests were primarily intended lo assess the 
MINNEM.^ST Frogran^ itself ratiior than individual stu- 
dents' progress. 

Instnicdonal Dependency: Small. Test jLcms were geii cr- 
ated that rcHecl the entire range of skills and behaviors 
associated with a given objeciive. 

Objective and I ton Generatioff. Re I e van I learner behaviors 
and skills associated .witli .a given conler.l urea were 
organized (by the MiNNFMAST staff) into classcs culled 
leartung doniains. The basic notion underlying this process 
is iiiai important classes of content and skill would be 
completely defined in terihs'of l:>ehaviorally stated, struc- 
tured sets or domains. 

Rules for generating./tpst items for a given learning 
domain are- organized into Rnmal schemes called item 
i'orms. There are three major components to an item form: 
(1) instructions (directions given to students), (2) stimulus 
characteristics tihe skills and behaviors an item can cover 
and rules for constructing specifi.c^ kinds of items), anil (3) 
response characlerislics (acceptable way of responding. lo 
an item, for example, written ov oral responses). 

Test Models and Packaging. (K should be noted that the 
MJNNFMAST elTorts reported here vverc field test activ- 
ities, and consequently a final pack:iging mode was not 
available). 

The MlNNIiiVIAST curriculujn war. divided into discrete 
units. For each um't the teacher Was provided with a hatid- 
txiok cotVaining a sequence of lessons, general statements 
about goals, explanatory background 'information, and lists 

. of materials needed for lessons, 

Test construction wa^ -computer-assisted and conducted 
by the MlNNFMAST staff. A system of student-ileni 
sampling was utilized to.gather information on all test iiems 
with a minimum of testing- tijnc.To this end computer 
printout labels were generated for each student listing liisor 
her name, identifying data such as class and school, and the 

,. /items assigned to him or her. When all the items specified 
Iroiii ah item form had been wriilen -the com|)utcr printout 



labels were altaclicd lo llicni uiid ihe ilenis were iIilmi 
collated into tests I'or tlie individual stiidenis. 

7V.S7 Scores. The prineipal data derived iVoiii tlie 
MlNNIiMAST testing program were (he proportion of cor- 
reel responses. Whenever possible, however, additional 
information was reported eoncerning the kinds of correct 
and incorrect responses being made. 

Although no set format for reportinc scores was stipu- 
lated, data were usually presented in tables showing 
complete itcni-by-item listings of actual responses as well as 
frequencies of various categories of responses (for example, 
frequencies for individual items, item forms or objectives, 
and groups of objectives). Due to the absence of empirical 
evidence, desired levels of achievement were not established 
in advance of testing. 

Additional infurtnation available from: 

Wells Mively 

Department of Psychology 
University of Minnesota 
Minneapohs. Minnesota 55455 

National Assessment of Educational Progress (NAEP) 

I''()cvs. The purpose of NAliP is the assessment of educa- 
tional attainments on a national basis, 

Insnvctionat' Dcpaukucy. Small. Neither objectives nor 
items arc referenced to any curriculum text or insti^^ictional 
program. 

Objective and Item General ion. NAl:P defines its objectives 
and the associated skills and behaviors (the ''domain of 
reference'') through a national consensus of opinion 
regarding the important goals and outcomes of education 
with respect to a given subject area. 

Objectives developed by NAliP's Exercise Development 
Department are reviewed by external subject matter experts 
and layman gro\ips. I'oll owing the development of objec- 
tives, contracts are awarded for item generation. The 
amount of items developed lor a given objective is based on 
a weighting scheme determined by the subject matter 
experts. A framework for ''item writing is provided by a 
.system ol" exercise prototypes that dellne I'onr character- 
istics of an item: ( 1) administrative mode (can the item be 
adniinistered individually or to a group). (2) stimulus mode 
(audio, visual, and so on), (3) response mode (multiple 
choice or free respcMise); and (4) response category (writ- 
ten, verbal, role playing, and so on). 

Test Models and Packaging. Tests are designed exclusively 
lor measuring student acliievcfncnt on a national scale. The 
number of items for a given objective is determined by a 
weighting scale based on priorities identified by the subject 



matter experts. Tests are available ai four age levels (^), 1 ,v 
17. and adult). Two subject areas are currently bemg 
assessed each year with a five-vear reassessmeiu c\cle. 
(M:itliemat!cs is scheduled \o\ ihe 72-7.i school war.) Two 
lumiJi'ed and ten minutes o*' testing are alU)ttod amuiallx' to 
each subject at each age level. 

Test Scores, Scores are generally reported as the p'^rcentage 
of correct responses by items and for various classes of 
items. For example, items dealing with solving algebraic 
equations might be compared with itenis on mathematical, 
induction, in addition, scores are Iiroken down in terms of 
typical performance by region, sex, SliS, and so on. 
Additional information available from; 

National Assessment of Liducational Progress 
X22 Lincoln Tower 
1860 Lincoln Street 
Denver, Colorado ^0203. 

Southwest Regional Laboratory (SWRL) 

Focus. SWRL is involved in the development of text- 
referenced instructional management systems that operate 
in conjunction with a developed curriculum. At present 
sucli a classroom nianagement system in reading is available 
at the kindergarten level and a math system is under devel- 
opment. Criterion referenced tests have been incorporated 
into this system to assess student progress. 

insinietional Dependeiier: Large. The SWRL program is 
speciilcally based on a predefined curriculum: . to be 
minimally useful the (CR) test must be specifically 
referenced lo' a presp?cified structure of acnievemcnt. To 
be maximally usetui the tests must be specifically refei- 
enccd to defined instructional materials'' (Baker. R.L.. 
1^^)72). 

Objective and I ton (feneration. Ilively's f 1^)73) itcin form 
approach and related prcK'csses are u(ili/ed to c: Mifitvc lasses 
of behaviors and skills associated with specific content 
areas. A collection of item forms sequentially organized 
together with a list of constraints on item generation pro- 
vide the framework for defining total content areas in 
behavioral terms (a "universe of content''). Strings ot'item 
forms are tlicn organized into tentative sequences oi 
''instructional specitlcatjons" that map out the instruc- 
tional and evaluation efforts consistent with the iteu\ 
forms. 

Test Models and Packaging. With respect to evaluation 
activities each instructional management system provides: 

• A means (vis- i-vis testing) for student placement 

• Criterion relcrenced measures on ^ to 8 instructional 



ERIC 



uutCiMiK'^ 10 lo 15 tiliios during the year. (..Noic; Thoe 
losts arc constructed lor specillc.inf'ojinatinn |nirp>><c\ , 
io assess student progress on objectives aiiejuled lo b\ n 
specific curricuinni.) , .^^ 

Addiiional practice materials for ilic instructional j»ut- 
conies wiiich have continuity lhrOu<.4>out tjic. text "'^ 

A niid-year and end-ol-year evIihHLl^on ino^iMire' ■ 

A Quality Assurance System {a user's manual providing 
directions a[ul jiacing int'orination.) 



7V.S7 Scores, The Quality Assui-apcc Manual pi^vs'^je^ f^ruis 
for reporting the inc\mi^.^tandard deVialiori:,. and perce~ni^<d^ 
students attaining criterion performance. RegresJ^ioji 
analyses between criterion scores. on final :ind fnicl-yeur^. 
criterion referenced tests are also reported bused on a large 
student saniple. 
Additional ini'ormation available from:' , 

Southwest Regional Laboratory for 

lulucational Research and D'cveiopnient 
4h()5 Lainpson Avenue • . 

Los Alaniitos, California ^)072O^^ 



System for Objectives Based Assessment- 
Reading (SOBAR) 

Center .for the Study of EvaluaUon: UCLA 

Focus. SOBAR basically ct)nsiitute"s an item bank inte- 
grated into a selection/delivery system intended as a multi- 
purpose evaluation procedure appropriate at the individual . 
group, or program To v\j-L.J3e signed to seiA'c as an exemphuy 
objectives based asscssmeuTsy^l^'P^- ^0^*^^^ includes a set 
of performance objectives covering the entire spectrum of a 
content area (in this case that of reading, grades K- 1 2). n . 
classification system for selecting objectives, and a bank of 
assessment items keyed lo specific objectives. 

Insirucfionul Dcpcfidoicy : Small. SOBAR is seen as a Hex- 
iblc. multipurpose test generation system that is not 
dcpendeni on a given insiructionui program or information 
need. 

Ohicctivc and llc^i-.Cn^ncKatiiui . A set of objectives was 
developed by the S0BA-4< staff ( with the help of reading 
experts) to cover the coniptete content area of reading.- 
These objectives were then chissified into categories rc- 
tlecting various skill areas and levels of generality. Upon 
completion of objective specillcation, the SOBAR staff con- 
structed items keyed to the objeclivcs: During item Wi iting 
special- ^'^it tent ion was given to independence (non- 
re dun dan c>^)-4M' items, objective* item congruence, and the 
comprehensivene^kof Uemsr'Thc-'tfys.Le.m js referenced lo 
performance object iv'e^^>; 



14 



7>\7 Moilcis -anJ Pih kuu'iny- Among ihc niatenals and 
se r V i c e v p r t u' i d e i ! ? oHi iiers SOBAR i n c I ' i de s : 

•"->-.V comprehensive caiah^gue of iieail> 5t)() objectives. 

(Tliesc objeciives cover grades K-1 2 and avo diviUed into 
..^^ s i \ n'i : ij o r n k 1 1 f v a I c go r i e n . ) 

A ^nide^and selection tTrai^t^o aid the user in selecting 
ohfcctives -.inpi^opriate to local pTti>rities 



Computer genera tnl reports of 
CibjecUve selection process 



onkH^me of 



the 



• .. Tests for each S()I^a\I\ objective. These measures are 
leveled .by gr.'.de clusters: K-.>, 4-(), 7-^), and 10-12. 
Depending on-the tiat'ire o\ the objective a test for an 
"individual objective inay^ontain 5-20 test items. 

Test ct)nstruclion is viewed in terms of the user's sjiecific 
inrofm't^ltorivnoeds^ Iteciy are selected lor tests acccM ding to 
tiie test model apprpp^riate fc)r a, given test situation. In 
^addition, tests can be as.sembled at different levels of t)bjcc- 
Itve^geiTeTality. 

Test Sarrcs. At pvese-n^-SQBAR has not begun lo field test 
methods of sco'c rcj.iorting and interjiretation. 
A^^J^'ti*"^'^'-^' inl'ormation available from: 

SOBAR Project . ' 

Center lor the Study of Lvaluaiion 

University of California 

145 Moore Hall 

Los An^cles. California ^H)024 



Zwcig and Associates 

luK'Us. Zweig offers a criterion referenced testiui^ program 
baseel on behavioral objectives and indexed to prescriptions 
for teaching alternatives. At present such testing projrams 
designed for classroom munagemcnt within the context of 
individualized instruction are available in reading and 
mathematics. The 1-ountain Valley Teacher Sujijxni System 
in matliemaiics was rf'viewed l\)r this paper. Comments are 
largely based on the Fountain Valley System. 

histnictioiial D.pcnJcncy: Small. Objectives and items 
cover the entiic spectrum o\' skills rellccted in the nine 
nuuhematics strands for California. 

Object ire and /(cm Goicnitiini. Objectives and itci'is are 
generated i\v teacher groups ( folk.^wcd by a review Imn 
experts) and reflect skills in each of the nine malhematicai 
.'irands developed by the California State Depaitmcfit of 
Lducation. Strands are measured at each grade level. K-<S. 
for which there is pertinent instruction. Typically .> to 5 
items are constructed for each objective. 



Test Madcls and Packui^ing. Tlie I'ouiMain Valley Sysicm 
includes: 

• 7.S5 objectives organized by strand and grade level 
» 1 selt'-scoring, seir-adiiiinisieiing lesis 

• Continuous Pupil Progress Profiles (o record individual 
student acliievemeni 

• Class ditto master's to document grouji performarje 

• Teaclier Manuals for each grade level (that include a 
listing ofall objectives at that level) 

• Manuals of criterion referenced teaching alternatives 

AD materials are color coded. Tape cassettes al each level 
provide directions for test administration. l:ach test is 
printed on a sealed torm made of treated ' paper that 
automatically records student's responses on the reverse 
side of the test sheet as the student takes the test. In addi- 
tion, the reverse side indicates correct or incorrect 
■ responses liy i\ number cotle which corresponds lo the 
objective and strand being tested and provides a score inter- 
pretation key to classify scores into "proceed'' and 
"ret each" categories. 

At each grade levc! a Teaching Alternatives Manual docu- 
ments (by number code) prescriptive activities (for skills 
falling into the reteacli category) listed by number code 
under each publisher's name and series. 

Te^t Scores. Student scores in each skill for each of the nine 
strands arc recorded on a Continuous Pupil Progress Profile 
(CPPP). The objectives for each strand are arranged tin the 
CPPP in a hierarchy of difficulty, grouped l y grade levels, 
and designated by '^olor.and number codes. Objectives 
measured by each test are then grouped between heavy 
lines. Student scores :;re recorded on the CPPP as either 
retcach or proceed in accordance with scoring instructions 
on the answer sIvxM. These instructions give the number of 
incorrect answers that determine the classification for each 
skill. 

Additional infornuition available from: 

Richard L. Zweig Associates, Inc. 
:0S00 Beach Boulevard 
Huntington Beach, California ^)?M8 

Summary and Conclusions 

This paper has attempted to outline the basic steps and 



procedures in the development of criterion reterenced tests 
as well as the issues and problems associated with these 
activities. In addition, representative CRT systems have 
.been reviewed. From tliie anaK'sis it is clear that the 
developer of a CRT must answer a number ot questions in 
order to clarify the nature and purpost: of a CRT, 

I. For what decisitm areas Lsr.d purposes is the CRT n\ost 
applicable? 

1, What areas and objectives does the CRT cover and how 
were these objectives derived and organized? 

How broadly or narrowh are the objectivC^s deHned? 

4. 1 low were the test items or tasks chosen to measure the 
objectives deluied and developed? 

5. How dependent are the items on particular instruc- 
tional materials or programs? And what is their appli- 
cability to different kinds of students? 

0. What methods were used to improve the items on the 
CRT ajid why were they chosen relative to the purpose . 
of the instrument? 

7. How was the validity of the CRT established? 

8. What kinds of ,scores should be reported for a CRT and 
what is the justification for these scores, especiallv. 
those involving "mastery?" 

Hew was the test finally put together, what compro- 
mises had to be made, ard how were they resolved? 

10. In what ways will pacKaging of the CRT facilitate its 
use? ■ • " 



These questions will hopefully serve three functions. The 
first is that they will guide CRT developers to the insucs 
that must be addressed in both the construction process 
and in the manual that accompanies the final instrument. 
The second purpose these questions may serve is to guide 
researchers lo those problems of major interest within the 
field of criterion referenced testing. Finally, they will help 
the purchasers of CRTs to understand better the kinds of 
variables they must consider in order to make a wise selec- 
tion of instruments and an appropriate inteipretalion of the 
results obtained with them. Certainly the pu.bhcation of a 
set of mlnmium standards for CRTs by an appropriate pro- 
fessional organization would go a long way toward ensu:'ing 
that these iiinctions have been carried out successfully. 



\ 



ERIC 



15 



APPENDIX: lOX Criteria for Selecting Objectives* 



Tlic folkuving criieiij slu)uk' be applied in deciding on (lie 
type of learner beliavior wliich will ,serve as the speeilic 
objective, ihereafter to guide the lest construction: 

(1) Dwisfcrability Wiihin Domain, The t'orm of Icnrner 
behavior selected should be ihe-niost generalizablc of those 
represented in the content general domain, i.e., a learner 
mastering tlie designated behavior requirements would 
likely be able to transfer that niaslery to most, if not all, of 
the other eligible behavioral requirements in the content 
general domain. 

In making such a selection it is important to consider the 
entire range of learner behaviors with wtiich we are con- 
cerned, i.e,, boili test-like events and real world events, F*or 
instance, in surveying an individual's mathematical 
conipeience ^we should be attentive not onlv to the \j , Xj. 
and X3, which we can represent via standard test formats 
but to liie Xj^. X)^. and \\(.). which might ictleci such 
skills as the ability to make change in a supermarket or {o 
complete one's annua! inc(Miie tax report. 

The test constructor should sketch out as wide a range of 
alternatives as possible, then select the one testable learner 
behavior which \yill most readily transfer to tlie other 
learner behaviors delin^iled by the content general o.bjec- 
tive. 

(2) Widely Acccjucd. The objective selected should be ilie 
most widely accej)tcd as important b\' those in the field. 
Unlike the lOX objective collections where we present a 
wide array t)t" aiici natives and the.i eiTvcnuage :ducatois to 
clu)t)se among them, here we will ha'\V' 'tcv-go.^.\\jth (he 
majority preference. ( learly. t!i!s criterion is not unrelalecT 
to criterion number one. but it may be prolilable to appl>' 
it independently. 

(.^) Tmnina/ify, If there is a degree of possible hierarch\' 
prcsetit in the contending types of learner behaviors under, 
consideration, such (hat some are considered precujsivc or 
cnroule to others, the chosen specific objective should 
represent the most terminal learner behavior. 



(4) Transferability Outside the Domain. Another consid- 
eration in selecting a specific objective -is the degree to 
which that behavicM'. once mastered, will be transferable 
t)Utside the content general domain, for e.\amj)le, to 
domains which might be learned hy students in the future. 
For instance, certain skills acquired by students in one 
course (such as ibe ability to distinguish between fact and 
opinion) may have reference to many other courses. Such 
high transfer skills and intellectual constructs should be 
given liigh priority in the selection of specific objectives. 

(5) Fmsc of Scorahility , In an effort to produce tests which 
have considerable practical utility, we must try l(^ select 
learner behaviors which, other factors being equal, can be 
easily scored by those educators employing tiiem. Again, 
this docs nor limit us to selected response items, for in 
some instances we shall surely find it necessary to utilize 
constructed response formats. (This may lielp di,stinguish 
the lOX tests from typical standardized tests.) Nevertheless, 
scoring practicality is a nontrivial consideration- 
No w how should these five criteria be employed in 

selecting the specillc objectives'.^ Should they be weighted 
equally, in descending order, or in reverse order (stratified 
accortling to number of two s\'llahle words in the descrip- 
tive paragraphs)? Sorry, but no handy sclieme is available 
for mechanical translation into decisions. Test constructois 
must, however, be self-consciously alteiilivc {o each ol* Uie 
.'Ive points. We may devise a check sheet or other shortiiand 
form to encourage such attention. If the lest de\elojxM lia.s 
exhausted all rational alleriuitives, an arbitrary' selection 

Having chosen, the spjccific objectivL'S. that is, the eale- 
gories to be used in generating a pool of liomoge neons lest 
items which assess a given learner behavior,-' (lie next task 
involves (he jiroduction of a defensible set oi'such items. 

*'r'.\i,\*rpic(J sviih pcrinissiciii cii' ihe nuiluir. \\',J, Ptipliaiii. iioin 
Sclcctiusi Ohjcctivci diiil (icncrafinji: Test itfnis for Ohji (■tivvs-Husci} 
Tests. l .K .AullcIl's, I0\, 1972. 



ErIck' 



Airasian \\, ^ Madaus. diicrion ic1lilmk-ol1 icitniiii in 
the classruoni. Mcasunnicnt in lulucatioiK l^^Tjl [A). 

\ 

Baker. H,L. Using nicasiircinent to improve insi/ruction. 
Paper presented at Convention of American Psychological 
Association. Honolulu, Hawaii, 1^)72, ED 069 

\ 

Baker, R,L, Measurement considerations in ii|struction 
product development. Paper presented at Conference on 
Problems in .Objectives Based Measurement, Center for 
the Study of Evaluation., University of California, 

Bormutli, J.P, On the theory of achievement rest items, 
Chicago: University of Chicago Press, 1^70, ; 

Cleary, T. Test bias: Validity of the scholastic aptitude te^st 
for negro and wliite students in integrated colleges. 
Research Bulletin PrinceloiK New Jersey, Educa- 

tional Testing Service. 1966. ED 018 200. ' 

Clcary, T,, & Hilton, T, An, investigatitin of item bias. 
Educational and Psychological Measurement, ,1 968, 28 ( 1 ), 
61-75, / 

Cox, R,, & Vargus, J,C, A comparision of ^item selection 
techniqurs for norm referenced and criterion referenced 
tests, Pitt.sburgh: Center for the Study of Instructional 
Programs, Learning Research and Development Center, 
Univer.sity of Pittsburgh, 1966. 

Cronbach, LJ, Test validation. In L, fhorndike (Ed.), 
Educational Measurement (2nd ed), Washington, D.C.: 
American Council, on Education, 1971, 

Dahl, T.A, The measurement of congruence between 
learning objectives and lest items. Ihipublislied doctoral 
dissertation. Univeisity of California, Eos Angeles, 1971. 

Davis. I-.B. ('iileri(M) rclcrenced tests. Paper presented at 
Annual AE:rA Meeting, New Wnk, 197! . ED 050 1 54. 

Davis, I'.B. ("riicrioii referenced measurement. 1971 AliRA 
Conference Summaries, [-:r1C/TM Report [2, 1972. 
Princeton, New Jersey; ERIC Clearinghouse on Tests, 
Measurement, and Evaluation, 1972, ED 060 LU. 

Davi,s, F,B, Criterion referenced measurement, 1972 AERA 
(\)nference Summaries, ERIC/TM Report 17, 1973. 
PrincetcMi, New Jersey: ERlC; Clearinghouse on Tests, 
Measurement, and Evaluation, 1973. ED 073 143, 

■Ebel. R.L. Evaluation' and educational objective^: Behav- 
ioral and otherwise. Paper presented at the Convention of 



the Ameiican iNychological .AsstK-iiiiion. 1Khu>Iu1u. 
Hawaii. 1972. 

Glaser. R. Instructional technology and the measurement of 
learning outcomes: Some questions. American Psyeh(yl' 
ogist, 1963. 18, 519-521. 

Glaser, R,, & Nitko, A, Measurement in learning and 
instruction. In R L, Thorndike (Ed.). Educational Mcas- 
urctnent (2nd ed,). Wa.shington. D.C.: American Council 
on Education. 1971. Pp. 652-670. 

Harris, C. Comments on problems of objectives based meas- 
urement. Paper presented at Annual AERA meeting. New 
Oi leans, 1973. 

Hively, W. Introduction to domain referenced achievement 
testing. Symposium presentation. AERA, Minnesota. 
1970. 

Hively. W,, Maxwell, G., Rabehl. G.. Sension, D., & Lundin, 
S. Domain referenced curriculum evaluation: A technical 
handbook and a case study from the MINN EM AST 
project, CSE Monograph Series in Evaluation, Volume I. 
Center for the Study of Evaluation, University of Cali- 
fornia, Los Angeies, (973. 

Keller, CM, Criterion referenced mea.surement: A bibliog- 
laphy. Princeton, New Jersey: ERIC Clearingluni.se on 
Tests, Measurement, and Evaluation. 1972. ED 060 041, . . 
bibliography ERIC/TM Report 7. 1972. 

Klein, S.P. Evaluating tests in ierms of tlie information they 
provide. Evaluation Co^nment ^ 1970. 2 (2). 1-6. ED 045 
699. 

Klein, S.P. .\n evaluation New Niexico's edu oaiional 
})riori{ies. I\iper present Cil a! WesteriT Psychiilogical 
Associaiiiiii, Portland. P)72. TM 002 735. (ED number 
not yet available.) 

Kosecoff. J.B. Klein. S.i^ Aiialy/.ing tests and (est items 
for sensiiivily instructional effects. CSE Working Paper 
No. 24. Center fi^r llie Study (^f tivaluation, Llniversil)' of 
C'aliibrnia. Los Angeies, 1973. 

Kriewall, T.E., & Hirsclu E. The development and inter- 
pretation of criterion re I ere need tests. Paper presented at 
Annual AERA Meeting, Los Angeles, California, 1969, 
ED 042 815. 

Mager, R, F. Pre pari tig instructional objectives. San Fran- 
cisco: Fearon Publishers, Inc., 1962, 



•Mtcnis roUowod l>y un VX> number (tor e.^^aniple KD 069 762) arc 
uvailuble froni the I-RIC' Document Reproduction Service (HDRS). 
Consult the most recent issue or RcscarJi in h'Jucaiion lor the 
address and ordering informalion. 



Millman, J. Passing scores and test lengths for domain 
retcrenced measures. Paper presented at Annual AERA 
Meeting, Chicago, 1972, ED 065 555, 



ERLC 



17 



Nitko. A.J. A model for L'riterion referenced tests hased on 
use. Paper presented at Annual AURA Meeting. New 
York, 1971. ED 049 318. 

Nitko, A.J. Problems in the development of criterion 
rcfcfcnccj tests. Paper presented at. Annual AERA 
Meeiing, New Orleans, 19.73. 

jOzenne. D.O. Toward an evaluative jiietliodology for crite- 
rion referenced measures: Test sensitivity. CSE Report 
72, Center for tlie Study of Ev.iluation. University of 
California, Los Angeles. 197 1 . ED 06 1' 263. 

Popliani. W.J. The teacher-^cmpiricist; A curriculum and 
instruction supplemenr. Los Angeles: Lennox-Brown, 
Inc., 1965. 

Popham, W., & Husek, T.R. Implications of criterion 
re f e re n ce d ni ea s u re ni e n t . f //v lal of Education i a! Mcasure- 
jnent, 1967. 6(1). 1-9. „ 

Pophanij W. indices of adequacy for criterion referenc<id 
test items. Presentation af Joint Session of NCEM and 
AERA, Minneapolis. Minnesota, J970. 

Popham. W.J. Selecting objectives and generating test items 
for objectives based tests. Paper presented at Conference 
on. Problems in Objectives Based Measurement, Center lor 
the Study of Evakiii.tion, University ol* California, Los 
Angeles. 1972. 



Roudabush. G. Some reliability problems in a criterion 
referenced test. Paper presented at Annual AliRA 
Meeting, New York. 1 971 . ED 050.144. 

Roudabusli, G.E. (lem selection of criterion referenced 
tests. Paper presented , at Annual AERA Meeting. New 
Orleans, 1973. ED 074 147. • " 

Sk'.iger, R. Generating criterion referenced tests from, 
objectives based assessment systems: Unsolved problems 
in test development, assembly and interpretation. Paper 
presented a\ Annual y^ERA Meeting, New Orleans. 1973. 

Wilson, H.A. A humanistic approach to criterion referenced 
testing. Paper presented at \nnual AERA Meeting. New 
Orleans, 1973. 

Zweig, R., & Associates. Personal communication, March 
15, 1973. 

Selected References on Test Item Construction 

Ebel, Robert L. Essctitials of educational meamvement. 
■ En.glewood Cliffs, New Jersey: Prentice-Hall Inc., 1972. 

Gronlund. N.E. Constmcting achievement tests. Englewpod 
Cliffs, New Jersey: Prentice-Hall, 1-968. ^ 

Wood, Dorothy A; Test construction. Columbus, iDhio: 
Merrill, 1961. 



